Predicting proximal tubule failed repair drivers through regularized regression analysis of single cell multiomic sequencing

Renal proximal tubule epithelial cells have considerable intrinsic repair capacity following injury. However, a fraction of injured proximal tubule cells fails to undergo normal repair and assumes a proinflammatory and profibrotic phenotype that may promote fibrosis and chronic kidney disease. The healthy to failed repair change is marked by cell state-specific transcriptomic and epigenomic changes. Single nucleus joint RNA- and ATAC-seq sequencing offers an opportunity to study the gene regulatory networks underpinning these changes in order to identify key regulatory drivers. We develop a regularized regression approach to construct genome-wide parametric gene regulatory networks using multiomic datasets. We generate a single nucleus multiomic dataset from seven adult human kidney samples and apply our method to study drivers of a failed injury response associated with kidney disease. We demonstrate that our approach is a highly effective tool for predicting key cis- and trans-regulatory elements underpinning the healthy to failed repair transition and use it to identify NFAT5 as a driver of the maladaptive proximal tubule state.

Supplementary Figure 1.Analysis of ATAC modality in single nucleus multiomic dataset.a. Overlap between peaks called on this dataset and prior snATAC-seq dataset generated in our lab.b.Gene activity scores versus expression for all genes in dataset.Genes that were selected as top variable features in both the Gene Activity and RNA assays are colored red.c.Confusion matrix showing agreement between cell type annotations as determined by RNA marker expression and cell type predictions using Seurat's LabelTransfer method.Supplementary Figure 2. Comparison of doublet prediction algorithms.a. Left, cartoon of doublet finding approach used by AMULET.Nuclei have two copies of a given genomic locus, so barcodes with many loci with greater than 2 reads mapped are predicted to be doublets.Right, cartoon of doublet finding approach used by DoubletFinder and ArchR.Doublets are generated by averaging the expression or chromatin accessibility profiles of two barcodes in the dataset.Barcodes    Supplementary Figure 15.Comparing the use of promoters and distal CREs for modeling.a. Regulatory scores calculated for selected TFs when model is trained with motifs from all linked CREs (purple) and from motifs only in target gene's promoter region (yellow).P values shown for two-tailed t-test of the difference in mean scores.Data presented as mean estimates of regulatory scores with error bars representing standard errors, calculated by modeling n=100 bootstrapped samples.Source data are provided in the Source Data file.b.Density plot of coefficients of determination of model predictions of differentially expressed gene expression versus actual expression when training model with motifs from all linked CREs (purple, mean r 2 = 0.45) and motifs only in target gene's promoter region (yellow, mean r 2 = 0.38).P value calculated for two-tailed two sample t-test.Source data are provided in the Source Data file.c.Number of TFs per modeled gene before and after RENIN training.The initial number consists of all unique TF motifs present in a gene's linked CREs and promoter peaks.The trained number of TFs consists of the number of TFs with RENIN-predicted non-zero regulatory weights for that gene.P value calculated for two-tailed paired t-test.Analysis performed for n=1514 modeled genes.Source data are provided in the Source Data file.d.Density plot of ΔAIC for 1323 modeled genes with at least one promoter peak (MACS2-called peak within 2000bp of TSS), calculating ΔAIC as the difference between the AIC of the RENIN-trained model using TFs within all predicted linked CREs (CCAN-cis-coaccessibility network) and the AIC of the trained model using TFs within the promoter peak(s) only.Mean (gray dashed line) and median ΔAIC are -1293 and -771, respectively.1184 genes have a negative ΔAIC, while 111 have a positive ΔAIC (black line).

Supplementary Figure 18. Comparison of predictions between five modeled datasets. a.
Matrix of Jaccard indexes calculated between the top ten predicted FR-associated TFs for each dataset.Index of 1 indicates perfect overlap and index of 0 indicates no overlap.b.Matrix of Jaccard indexes calculated between the top ten predicted healthy-associated TFs for each dataset.Index of 1 indicates perfect overlap and index of 0 indicates no overlap.c.Frequency of prediction as a top ten FR-associated TF for all TFs in at least one top 10 FR list.TFs that were top FR predictions in all 5 modeled datasets are GLIS3 and NFAT5.d.Frequency of prediction as a top ten healthy-associated TF for all TFs in at least one top 10 healthy list.TFs that were top healthy predictions in all 5 modeled datasets are HNF4A and PPARA.
are predicted to be doublets by the proportion of neighboring barcodes that are simulated doublets versus original barcodes.b.Venn diagram of predicted doublets by each algorithm.c.Violin plots for number of unique and total peaks per barcode, labeled by doublet prediction by ArchR and AMULET.d.Counterclockwise from top right, WNN UMAP plots of dataset with doublet predictions by AMULET, DoubletFinder, ArchR, and by all three algorithms.Supplementary Figure 3. KIM1+/VCAM1+ PT population present in living donor biopsyderived samples.a. UMAP plot of reclustered living donor biopsy-derived-only dataset (n = 5 from 3 individual donors).b.Expression dot plot of healthy PT (CUBN) and failed repair PT (VCAM1 and HAVCR1) marker genes.c.VCAM1 expression in living donor biopsy-only dataset.d.HAVCR1 expression in living donor biopsy-only dataset.e. Heatmap of cell type marker snATAC-seq peak accessibility for each cell type by sample type-nephrectomy or living donor biopsy.
H PT DARs FR PT DARs snATAC−seq peak set Percent of peaks overlapping RPTEC ATAC peaks a b Supplementary Figure 5. RENIN performance measured by healthy and failed repair proximal tubule CREs recall of RPTEC histone modification peaks is competitive with other methods.a. From left to right, ROC curve against RPTEC H3K27ac peaks identified with CUT&RUN, calculated for FigR (FR CRE AUC: 0.622, Healthy CRE AUC: 0.552), SCENIC+ (FR: 0.559, Healthy: 0.450), Cicero (FR: 0.536, Healthy: 0.552), and scMEGA-predicted healthy-FR PT (H-FR AUC: 0.570) CREs.ROC curve for the pseudotime trajectory-based scMEGA is for predicted H-FR trajectory CREs.Dotted line indicates random performance (AUC = 0.5).Source data are provided in the Source Data file.b.From left to right, ROC curve calculated against RPTEC H3K4me3 peaks for FigR (FR: 0.637, Healthy: 0.564), SCENIC+ (FR: 0.490, Healthy: 0.428), Cicero (FR: 0.584, H: 0.587), and scMEGA-predicted healthy-FR PT (H-FR AUC: 0.513) CREs.Source data are provided in the Source Data file.The effects of increasing   on CRE predictions.a. Calculated AUCs for H3K27ac (purple) and H3K4me3 (green) peaks in RPTECs with FR (failed repair PT) and H (healthy PT) CRE predictions with increasing  " .b. Number of unique CREs predicted with increasing  " .Random number generator seeds and pseudocell matrices used were kept the same across trials.Supplementary Figure 8. From left to right, enrichment of partitioned (a) CKD heritability and (b) eGFR heritability into predicted CREs by FigR, SCENIC+, Cicero, and scMEGA.To bin model-predicted CREs into top, middle, and bottom third bins, scores calculated by each method were summed for each CRE across all of its target genes.FigR CREs were sorted by rObs, SCENIC+ CREs were sorted by R2G_importance_x_abs_rho, Cicero CREs were sorted by coaccessibility score, and scMEGA CREs were sorted by TStat.N = 7 biologically independent samples containing 50,768 cells were examined in a joint analysis.Error bars represent standard errors around estimates of enrichment by LDSC with a block jackknife over n=200 equally sized blocks of adjacent SNPs.Source data are provided in the Source Data file.Supplemental Figure 12.Visualization of gene regulatory networks predicted by RENIN by graph and simulation.a. TF node size represents centrality computed by betweenness, top 20 TFs are labeled.b.PCA plot of subsampled PT dataset, with n = 1000 KIM1+ PT cells and n = 1000 starting PST + PCT cells.Simulated group 1 and 2 are results of upregulating NFAT5, GLIS3, TCF12, MEF2A, and KLF6 in the starting PST + PCT subset.c.PCA plot of subsampled PT dataset, with n = 1000 starting KIM1+ PT cells and n = 1000 PST + PCT cells.Simulated group 1 and 2 are results of upregulating ESRRG, PPARA, RREB1, RORA, and HNF4A in the starting KIM1+ subset.TF upregulation is simulated by increasing expression by 5 (group 1) and 10 (group 2) standard deviations in the starting sample, calculated by each TF's expression in the PT dataset.Simulation results were similar over n = 3 independent trials.Supplementary Figure 13.Benchmarking RENIN predictions.a. r 2 calculated for RENINand Pando-predicted H-FR gene expression by training on Gerhardt 2023 multiomic dataset compared to target gene expression in an independent Kirita 2020 dataset for genes that were successfully modeled by both methods.For shared genes, mean RENIN r 2 was .068and mean Pando r 2 was .042.Source data are provided in the Source Data file.b. r 2 calculated for RENINand CellOracle-predicted H-FR gene expression by training on Gerhardt 2023 multiomic dataset compared to target gene expression in an independent Kirita 2020 dataset for genes that were successfully modeled by both methods.For shared genes, mean RENIN r 2 was .012and mean CellOracle r 2 was .0035.Source data are provided in the Source Data file.c.Number of H-FR differentially expressed genes modeled by each method by training on Gerhardt 2023 multiomic dataset.Source data are provided in the Source Data file.d-e.To measure similarity of TF predictions using different sizes of pseudocells, matrix of pairwise Jaccard indexes calculated for the top 20 model predicted (d) FR and (e) H TFs using target bin sizes of 1 (no pseudocell binning), 5, 10 cells per pseudocell.
Pairwise Jaccard index comparing top ten FR TFs between datasetsPairwise Jaccard index comparing top ten H TFs between datasets NFAT5 upregulation in PST and PCT cells increases FR expression phenotype.PCA plot of subsampled PT dataset, with n = 1000 starting PST and PCT cells and n = 1000 KIM1+ PT cells.Upregulation of NFAT5 is simulated in these starting healthy PST and PCT cells (Simulated group 1 and 2).Simulated group 1 and 2 are results of upregulating NFAT5 by 5 (group 1) and 10 (group 2) standard deviations in the starting PST + PCT sample, calculated by each TF's expression in the PT dataset.Simulation results were similar over n = 3 independent trials.NFAT5 in healthy PT cells